Clustering with Missing Features: A Penalized Dissimilarity Measure based approach

نویسندگان

  • Shounak Datta
  • Supritam Bhattacharjee
  • Swagatam Das
چکیده

Many real-world clustering problems are plagued by incomplete data characterized by missing or absent features for some or all of the data instances. Traditional clustering methods cannot be directly applied to such data without preprocessing by imputation or marginalization techniques. In this article, we put forth the concept of Penalized Dissimilarity Measures which estimate the actual distance between two data points (the distance between them if they were to be fully observed) by adding a penalty to the distance due to the observed features common to both the instances. We then propose such a dissimilarity measure called the Feature Weighted Penalty based Dissimilarity (FWPD) measure. Using the proposed dissimilarity measure, we also modify the traditional k-means clustering algorithm and the standard hierarchical agglomerative clustering techniques so as to make them directly applicable to datasets with missing features. We present time complexity analyses for these new techniques and also present a detailed analysis showing that the new FWPD based k-means algorithm converges to a local optimum within a finite number of iterations. We have also conducted extensive experiments on various benchmark datasets showing that the proposed clustering techniques have generally better results compared to some of the popular imputation methods which are commonly used to handle such incomplete data. We have appended a possible extension of the proposed dissimilarity measure to the case of absent features (where the unobserved features are known to be non-existent).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی داده‌های بیان‌ژنی توسط عدم تشابه جنگل تصادفی

Background: The clustering of gene expression data plays an important role in the diagnosis and treatment of cancer. These kinds of data are typically involve in a large number of variables (genes), in comparison with number of samples (patients). Many clustering methods have been built based on the dissimilarity among observations that are calculated by a distance function. As increa...

متن کامل

K Important Neighbors: A Novel Approach to Binary Classification in High Dimensional Data

K nearest neighbors (KNN) are known as one of the simplest nonparametric classifiers but in high dimensional setting accuracy of KNN are affected by nuisance features. In this study, we proposed the K important neighbors (KIN) as a novel approach for binary classification in high dimensional problems. To avoid the curse of dimensionality, we implemented smoothly clipped absolute deviation (SCAD...

متن کامل

Clustering with Intelligent Linexk-Means

The intelligent LINEX k-means clustering is a generalization of the k-means clustering so that the number of clusters and their related centroid can be determined while the LINEX loss function is considered as the dissimilarity measure. Therefore, the selection of the centers in each cluster is not randomly. Choosing the LINEX dissimilarity measure helps the researcher to overestimate or undere...

متن کامل

Clustering Algorithm for Incomplete Data Sets with Mixed Numeric and Categorical Attributes

The traditional k-prototypes algorithm is well versed in clustering data with mixed numeric and categorical attributes, while it is limited to complete data. In order to handle incomplete data set with missing values, an improved k-prototypes algorithm is proposed in this paper, which employs a new dissimilarity measure for incomplete data set with mixed numeric and categorical attributes and a...

متن کامل

Redefining the Bayesian information criterion for speaker diarisation

A novel approach to the Bayesian Information Criterion (BIC) is introduced. The new criterion redefines the penalty terms of the BIC, such that each parameter is penalized with the effective sample size is trained with. Contrary to Local-BIC, the proposed criterion scores overall clustering hypotheses and therefore is not restricted to hierarchical clustering algorithms. Contrary to Global-BIC,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1604.06602  شماره 

صفحات  -

تاریخ انتشار 2016